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Abstract 

To establish a long-term research facility for further 
ex perimental investigations of design diversity as a means of 
achieving fault-tolerant systems, are have designed and 
i mplemen ted the UCLA DEDDC (DEsign Diversity 
exp erim ent) system, a distributed tnthrd for mulciple-versian 
software, at the UCLA Center for Exp erimen tal Computer 
Science. DEDIX is part of the Center’s Olympus Net local 
network, which utilizes the Locus distributed op er ating system 
to operate a set of twenty VAX 11/750 computers. DGDtX 
will be used in second-generation ex p erimen ts now being 
designed and coordinated at four universities to measure the 
efficacy of design diversity and to investigate reliability 
increases under large-scale, controlled ex pe ri m en tal conditions. 
The DEDIX system is described and its application is discussed 
in this paper. A review of current research is also presented. 

1 Introdacdoo 

Originally, fault-tolerant architectures woe developed 
to tolerate physical faults that are due to random failure 
phenomena in the hardware of a comp u ter system. Often, 
identical hardware channels are used in simultaneous multiple 
computations in order to attain fault-tolerance. The hade 
assumption is that the physical faults arc uncorrelated. Mare 
recently, the tolerance of design faults, especially in software, 
has gained increased attention. Here, it is not possible to use 
identical copies, since the same fault vOI manifest itself in all of 
them. Design diversity is the a p pr oa ch in which redundant 
hardware and software el emen t s are independently designed to 
meet system requirements. These re dun dant di v er se dements 
arc used in multiple computations in order to tolerate design 
faults. 

Software design diversity, or V-vcnkn programming 
[Avi77], is defined as rhr ■ g ener ation of N fc 2 software 
"versions' from the same specification. The goal of the 
sprrifi cation is to state the functional req uiremen ts comp letely 
and unambiguously, while leaving the widest possible rhoicr of 
impl emen tations. Thereafter, v ersi ons arc i n d ep e n dently 


written by N p r ogra mming teams that do not interact with 
r espect to the programming p r ro-ss . Since the versions are 
written ind epen dently, it is hypothesized that they ate not likely 
to contain the same er r or s. Lc-. that errors in their results are 
uneorrrlated. In a se ri es of small scale experiments, multiple 
v ersions have been executed in parallel and the results from 
them have been compared and voted. These *0th generation" 
experiments demonstrated the feasibility of the con c e pt and its 
effectiveness in dealing with software faults [Chen 78 1- It was 
observed that a major complies non, compared to voting on 
identical copies, is that the results might be different due to 
diversity, but stil] similar and correct. For example, a Boating 
point algorithm can be written in several ways yielding slightly 
different results. The derision algorithm must acc ep t these 
similar results so that a version win not be discarded 
unnecessarily. This early research also confirmed the 
pnirirality of ex p erimen tal investigation and confirmed the 
need for high-quality software sprrifi canons, since many related 
er ror s could be need to a poor specification. 

The principal aim of tbc subsequent first genera dan 
research was the invesdgadon of software spcrifieaion 
techniques and tbc types and causes of software design faults. 
Improvements both in software sp e c i f i c ati o n techniques, and in 
the use of those techniques, were proposed [KeIlB3]. 

Planning far the second genera tion ex p eri ments is now 
underway. UCLA is coo p er ating with the University of Illinois, 
the University of Virginia, and North Carolina State University 
to conduct large scale ex p eri ments under the sponsorship of 
NASA. Hypotheses on cor r el ated errors have been formulated 
and will be validated, also the east effectiveness and the 
reliability increase will be estimated. To ‘•trshiivti a long-term 
research facility for these second generation experimental 
investigations, the DEsign Diversity experiment system 
(DEDDC), s distributed testbed at the UCLA Or ter far 
E xperimen tal Computer Science has been d es ign ed and 
implemented. This paper describes the requi remen ts of DEDDC, 
tbc N- ver s i on environment, and the design, implementation, 
and current ex p erien c e with DEDDC Besides serving as an 
ex pe ri mental vehicle, DEDIX is available as a node with very 
high reliability for other users at UCLA. 
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1.1 DEDtX Functional Requirements 

The general fimrrinml requirements of DEDtX arc: 


2 Pnncdanal Description of the DEDtX System 


I 


Distribution: the veniotu ibauid be able to execute on separate 
physical sites in or d er to take advantage of physical isolation 
between sites, to benefit from parallel execution, and to survive 
a ecash of a minority of sites; 

Transparency: the application p ro gramm er must doc be 
required to write sprrial software to take ore of the 
multiplies ry. and a version must be able to run in a syst em with 
an arbitrary value of N without morfifii-sririn*- 

DecUon algorithm: a reliable derision algorithm that 
determines a single decision result from the multiple version 
resul t s must be provided. The algorithm must be able to 
tolerate and to treat allowable differences in " ' «** riral values 
and slightly different formats (c-g., misspellings) in human- 
readable results; 

Envir on ment: DEDK mu st tun an the distributed Locus 
environment at UCLA and must be easily portable to other 
Unix systems. DEDDC must be able to run concurrently with 
all other normal activities of the local net work. 

The DEDK system can in many ways be looked an as 
an extension of the SIFT system, [Wens78] that is able to 
tolerate both hardware and software faults. Both have the 
same type of paragoning, with a cVvSown algorithm at each site 
that prm'tvs broadcast results, and a Global Execut ive at 
site that cakes consistent recoafiguredaa decisions. DEDK Is 
cxi mrirri to allow diversity in resalts and in version execution 
times. The SIFT system is a dock (frame) synchronous system 
that uses a dock to predict when results should be available far 
cwmparivm . This lynrfim nmrinn technique does not allow 
diversity in execution times and unpredictable delays in the 
enmrnmirsrirm, which both can be found in a distributed In- 
version environment. Instead, a syn chr o ni zation protocol is 
used in DEDDC, which does net use ref e r en c e to any notion of 
global rime widen the system. 

12 Related Research 

A sec ond approach to fanlt-tolctant software Is the 
recovery block rrrhniijv, in which alternate software versions 
ere organized in a manner similar to the dynamic redundancy 
(standby sparir;) technique used in hardware [AndeSl]. The 
objective cf the r ec ov er y block wrhwiqu e is to p erfor m software 
design fault d et ection during funrir n e by an n nrpriwr. test 
performed an the resales of one version, as epp mrrf to 
comparing results from several version. If the tear fails, an 
alternate version b ex ec u ted to implrmrn t r ecover y. This 
technique is currently bring investigated at several locations 
and DEDK an s u p po rt the execution af distributed re c o ver y 
M ode programs with relative ease. Several i m p orta nt research 
activities related to ffvcnioa programming and re c overy block 
techniques have been repo rted recent ly [AndcKJ. CrisSZ. 
Gmri79, KInd4. RamaSl. VogeS2J. 


2.1 Ser vi ce* and Str u ct u re 

DEDK together with the diverse, pro g ram versions 
the ability to tolerate software design and im plr mm r* rinn 
bolts. They Interact with each other and with their 
environment. Le., a user, so that together thwty can he — r. u a 
ryacm, DEDK itself doer net add airy Junctions to the system. 
Its purpose is to enhance the reliability of the system and to 
provide a tra nsp a r ent interface to the users, versions, and 
Inputfoutput system, so that ttey should not be aware of 
mnlriplr. versions and r eco very algorithms. An abstract view of 

• system with N v er sio ns is giv en In Figure L Informally 
speaking, DEDK provides the following services: 

• it hanrflrs requests from the user and distributes ib-m to 
all arrive versions; 


• it handles requests from the versions to have results 
nan i ted, and to distribute corn Be d results to the 
versions and to the imv 

• it takes derisions on whether or not the results from the 
v er sio ns agree; 

• it takes derisions on whether or not to discard faulty 
versions. 
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Bg. 1. The N altos at DEDDC 

PardtJonlng of DEDtX. The req uired services of 
DEDK can be ma pp ed cither carta a single processor running 
all versons sequentially, or onto a multiprocessor system, 
runni ng one local versi on on each processor. If it b mapped 
onto a single processor, then the lyaem b vulnerable to some 
hardware and software faults ih»f may cause errors in the 
operating system or DEDK software. It b of course possible to 
use design diver sity here as well, but some hardware faults will 
«nH came rHe single shared pi o om or to fsiL Also, a 
performance penalty b paid if the versions share the same 
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processor. In a multiprocessor environment, It (s possible to 
partition the system to prated It i gainst roast hardware faults 
as well. This can he done by pr oviding cadi piuu-u or with its 
own local version, operating system, and rieririnn algorithm. 
Some interprocessor ccramuni cation facility must be common to 
ill proces sors in order to be able to exchange results. It should 
be noted that the DEDIX design is suitable for any specified 
number N 2: 2 of processors and versions 

The Manifestation of Fralta. A hardware or software 
fault will affect a program version and It may also affect the 
underlying system. DEDIX is designed to be able to identify a 
mslfiitv-nemin g site and to tolerate both aacs of fault effects, 
provided that the errors can be rirtrrtrd. In the Cm case, when 
the errors and the faults can be i s o lated to a version only, the 
site anil attempt to corre c t the Internal state of the local 
version with dotisian results. In the second fault case, the site 
usually will dot be able to rec o ver by Itself and a global 
reconfiguration dnrisioa is necessary. AH fatltt wUl manifest 
themselves as either Inc o rrect results*, or 'missing results*. 

For rramplr, a "trussing* result from a site am be 
arosed by an cunotntn version, which is in an infmitr loop, a 
dcadlodced o pera ting sy stem , a hardware fault causing an error 
in DEDIX software, etc. A missing result at a site might also 
he caused by an exce s s i v e communication delay. The result was 
produced but never reached the other sites. It Is possible to 
identify why it is missing. When It is excessively delayed, the 
particular sending site will detect the discrepancy b et w een what 
it sent and what the other sites observed. 

Time-out Function- The only way to detect that a 
version did Dot produce a result when it was cxprxtr d to or 
when the result is "stuck" somewhere in dv- n v mm i m irvnrwi 
system is to use a tune-out functi on , ue., to require that a 
version must produce a result within s time-interval. Two 
time-cut techniques have been considered. The first technique is 
sireilvr to the time -acceptance test in the recovery blo ck 
technique. A time-out function is sorted si the beginning of 
M f h p-i-vr- rtf t ampu tation and all versions produce results 
wi thin this rime, interval to pass the rime acceptance test. The 
length of the interval can either be adjusted to cadi segm e nt of 
computation or to a “worst rase* interval far all computations. 

In the second technique, the time-out Interval is started 
when a majority of results have arrived at a site. Far rramplr. 
the time-out is started when the third result arrives In a 
configuration with five active versions. This technique is based 
on a comparison b et w een relative e xe c u tion times instead of 
using an absolute rime, as in the first technique. The time-out 
b of co ur se terminated if all results arrive before the time- 
interval ex p i res . A malfunctioning version sending results too 
early anil not cause any problems, since they anil not start the 
time-out. Interestingly, the problem is similar to 'comparing 
results with skew*: the median number (result number 3 out of 
S ) constitutes the closest to the 'ideal value* and the skew 
corresponds to the time interval. One advantage with this 
technique, compared to the previous, b that there b no ne ed to 
assign an individual time-out for each segment of computation. 
This is an advantage, since the execution time might depend on 
an a priori unpredictable input, which might put the 
computation into a loop of long duration. 


Both techniques can crln together in DEDDC, and It 
might depend aa the application, input/oueput, the computing 
environment, and real time requirements, which type b 
Both techniques require chat redundan t computations should 
start aimoat at the same time at rx rh p nm- i a nr and tha t 
redundant user input also oust asrivc within the time interval. 
In the current implementation, the latter technique b 
I mp l em ented, due to the Implementation environment and type 
of computations- The time interval b act by the user and can 
be quite wide, since all versions are suspended until f time 
Interval has expmed or until all results have a rrived . This 
imp-neon is pmsihir since currently there b no real time 
r rqu i rens . r u within DEDDC. The system would need some 
modifications tc accommodate the tu ne »rr>jir»n.» a-u 

f e vSn i qiv- 

Layercd Design. The layered design of DEDDC has 
many advantages. One of the roost i m por ta nt b that it reduces 
complexity. Tbc purpose cf each layer b to offer ■ -> m in 
services to the higher layers, shielding the higher layers from 

an bow the offered Servians actually are i mpV rr. need 
Each layer adds new services to thorn provided by the lower 
layers. Tbc structures and algorithms of mr layer are not 
visible outside that layer. For /■rampti- 1 • layer can provide a 
fault-talcrant service that includes redundancy and algori thms 
far fault-detection and recovery. Ano ther important advantage 
b that the implementation at a given layer on be changed 
without affecting the other layers, pr.-ti ded the serv ice of the 
layer b unchanged. 

DEDDC b designed as a set cf hierarchically stru c t ur ed 
layers to reduce complexity. Each site has an identical act cf 
layers and entities. These layers, from the bottom to the top 
are: Transport Layer. Synchronization Layer. Decision and 
Executive Layer, and Version Layer. Each layer provides a set 
of services, which are described below and shown in Figure 2. 



Rg. 2. Tho layers at Etta I of DEDIX 
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2JZ The Transport Layer 

ParpMi This layer coo t r ola (be c omm unicaricn cf 
messages (containing (be results) b et w een (be sites. Messages 
are broadcast to all active sites. The layer makes sure that no 
message is lost, duplicated, damaged, or misaddressed, and it 
preserves (be ordering of rent messages. A disce rn n e ed oq is 
repor ted to the layer above. 

Comments; Si n ce there is no such thing as a fault-free 
connection. the Transport Layer must be identified with the 
Likelihood that a message is last or damaged. i.c.. the reliability 
of the service must be stated. Also of interest for the hi gher 
layers are its r esp o nse time and throughput. The Transport 
Layer is cxprr t e d to use a redundant underlying communication 
structure to meet the reliability requirements. 

Implementation: Currently, a single ring structure of 

tnfi-r pr rss wvr UNDC pipe* jj nierf tlnr. ihit jjplejgfl tt new, 

does not allow a site crash, a redundant int ercnrinerrion 
strueture is under implementation. The Initial ring 
unplemestarioa provided us with wonc detet a u in the 
system which made it much easier to observe and debug the 
Transport Layer. 

U The Synchronization Layer 

Purpose; For each physically distributed site, this layer 
broadcasts results (using the Transport servioe) and collects 
messages with the results, (* cc-vcrior*) from all other sites. The 
layer only a nxpirt results that are both broadcast within a 
cer tain "me interval and 'h»r trill a r r i v e within the same hme 
interval. The mti*er/-H results are delivered to the Derision 
function. The layer a m pa a new set of results when every site 
has confirmed that all or enough of the previous results have 
been deliv ered . 

Comments: The processors need to be event- 

syn chro nised in order to umm that results from corresponding 
computations are compared. Otherwise, if cwo sets of results 
from two different computations are cornered, the Derision 
algorithm might wrongly conclude that some of the processors 
are faulty. Traditionally, this synchronization has bocn obtained 
by r efer ring to a common dock or set of docks. Tbc SIFT 
system [ Mel 1H2 ] is one exampl e of such a dock synchronous 
system. In SIFT it is p nufrird what the results should be 
far a comparison. To ensure that the results are 
available in SIFT, several design measures are ta ken to 
eliminate all unpred ictable delays, sods as using a fully 
connected e w imi wieuinn structure, using strict periodic 
scheduling, not allowing ex t er nal interrupts (only dock 
interrupts are allowed for scheduling), and regularly 
synchronizing the rinrfci 

The underlying distributed system and the versions have 
rlv follo wing finrwf/ T ifriri which wlff i hr clock synchronous 
technique difficult to use or Impractical La DEDK 

• rbr versions have difGc nr execution rimer between the 

cross-check paints; 

• thr versions wiU run concurrently with other network 


activities, which means that processors temporarily «-»„ 
be heavily lo ade d , and henee prolong the rirw- to 
execute some versions; 

® the communication network h, , Inherently varying 
transport delays of messages. 

Implementation: A synchrammccu protocol la cfcxigned 
to provide the service. It ensure that the results that are 
corn p ared by the Dsclxion function are front the cross- 
check (cc) point is each version. The version are stopped until 
all of them have reached the same cc- point, xr.J they are not 
started again until the results are exch.-rgetf and a derision is 
made. To be able to detect versions that are in an infini te loop 
and to allow slow versions to catch up, the previously 
mentioned rime-out technique is by tbc pc w rr^ni 

Using this protocol means that the synchronization of 
the system is based on: 

• the fact that c oriua ly working versions must produce 
exactly the same number of oc- vectors; 

• that correctly working versons have similar execution 
rimes, i.c., they will produce results wi thin or before a 
tprri fieri rime-out interval; 

• that * m i s s in g* or disagreeing results do not exist at a 
majority of sites. 

Each site has both a Sands and a Receiver entity in 
this layer, which communicate with tbc n rhi-r site’s 
eon t-spun ding entities according to tbc pro t oco l. Tbc Receiver 
entity m l Lem messages from tbc Senders and it delivers rhrm 
to the Derision function. Alter the delivery, it sends 
acknowledgements back to the S cn d sn to confirm tlx: delivery. 
When a Sender entity has collected acknowled gem ents from all 
the other sites or when [t has at least a majority of 
acknowledgements. It will Indicate this to the Derision and 
Executive layer. This in dim don is used to restart the Versions- 
It might seem unnecessary to use acknowledgements, since a 
Receiver can inform tbc Sender that ic ha* mrived enough 
res u lts. However, tbc Rcorivcn might have an inconsistent 
view on the number of recei ved results. For exampfe. in a three 
aite environment, two sites might get three results immediately 
while the third site only gets two. The third site cannot yet 
a rzrpt a new set of results. By using rhU indication, it La 
ensured that ail R ec e i v er s are ready to accept a new ret of 
results. The specification and vaifimrioe of the protocol is 
described in a companion paper [GunnSS]. 

2.4 The Decision and Executive Layer 

Purpose: This layer rec e i ves results (specified as “cross- 
check vectors') from Its local version. r»lr<-« agreement 
decisions an results rec ei ved from all other versions and 
delivered by the Synchronization layer, tfctcrmiocs whether the 
local version is faulty or not. and recovery derisions. 

Corrected results are forwarded to the load version. It controls 
the inpucf output of the local version. AH exaeprians that cannot 
be h a ndled elsewhere are direacd to this layer. 


Implements dootTbc layer has four entitles. ■ Sender, a 
Deri lion function, and two entities (or cm trolling the recover y 
proems, a Local Executive and a Global Executive. The Sender 
entity icodva the request i from the loral version and It 
responds to the v ersi on when a decision bu hrm taken with 
com e ted results. There are four different types of normal 
companion requests: Intermediate note oc-voaor, output ce- 
vector. Input, and version termination. All of them are 
broadcast to the ocher sites, and run through the Dedsion 
f un cti on to ensure consisteaey and tyoebromratim. At an Input 
request, a *<-iu>n is Cist taken on the format vector before the 
actual input read is performed. When the local version has 
raised an exception from which it ratmnt recover, it will use the 
abnormal exception request, which is immediately directed to 
the I /teal Executive. 

The Sender entity s ruches an occurrence number to the 
current oe-irlmrfficr (oc-id) of the cc-veczor. The occurrence 
number is used to uniquely identify a oc-id, siooe the same oc-id 
wd appear in loops and other repea ted p ro gram sequences. Far 
each timr a version is requesting a decision, the eminence 
number for char oc-id request is m err .ru n eed. 

Inpnt/Ootpnt System. The input/output system to the 
versions is designed to be replicated as wdL However, In the 
current i mplemen tation a centralized terminal c onne raon is 
used far all input/output, and the data to be printed or read is 
distributed to all venians. They will view the data as replicated 
The interface betwe e n DEDIX and the input/output system is. 
similar to the interface betw een DEDIX and the local version. 
Far i-Tjmpb-, a read from a terminal might be timed-out if It 
do u doc respond la with ether tennizuls, ud [be 

output data is rtrs through the Derision function before the 
actual output. The request to read data is also run through the 
rwroi-m function to ensure consistency. 

The Global Fxrrndre. The purpose of the Global 
Executive b to; a) onllrrt error reports from the Decision 
function and the Local Executive, b) exchange error rep orts 
with every other active Global Executive, and e) deride on a 
new reconfiguration, based on all error rep ort s . The current 
implementation b rudimentary. The functions of the Global 
Executive are basically the same as in Che SIFT system One 
rijffrmw is that the SIFT exec uriv c b invoked at predefined 
rime, intervals at all sites. Thu b ooc possible in the DEDIX 
system, liner, the sites might have a different state of 
com putation at the same time. Instead, the Global Executive b 
invoked after a preset number at exchanges of results (- 
cumber of decisions) has taken place. The number of exchanges 
Is the only consistent computation state in all sites. That b, by 
referring to this number, it b possible to e n sui r . that all 
correctly working sites anil exchange error reports and deride 
an a reconfiguration at the same state of co mput ation. This 
number is kept oOQlUtCDt by thr. tynrh mnin onn pTBtOOOL 

Error Reports. Every local Executive has an error 
report table, with one entry for each site. In the current 
implementation this entry b an error counter for that site. The 
Local Executive i ncremen ts the counter for a site, each rime 
that the site has other a disagreeing or missing result. This 
means that the Local Executive does distinguish between a 
missing result and a delayed result- Since sites might get 


(Efferent numbers of results due to varying r n( n m imi<-»rlm 
delays, rites may have slightly (Efferent error repor ts. 

The control is moved to the GWwl Executive when an 
cxdiangc of results should nice place The Global Executive at 
a rite will temporarily take the place of the local version and 
use the broadcasting and tieeitirm functions of the underlying 
layers. The Error Report b put into a regular message and 
delivered to the Sy nrhrrgri ration layer, which does not per c ei ve 
the diff er e n ce. The Synchronization layer onlt-«-n messages as 
usual and they are run through the Derision function in ord er 
to ensure that every rite has a consistent view on the error 
reports. 

The Reconfiguration Decision. The Globa] Executive 
win get a consistent error report on which it derides an the 
reconfiguration. In the current Implementation, only a 
degradation can be done. It Is net possible to start an inactive 
rite. A rite b proposed to be duconnceed if the number in the 
error repo rt s counter exceeds a predefine d threshold value, say 
50% of all exchanges. AD Global Exrmtires p t- o p u s c a new 
configuration that b also broadcast to every other rite and run 
through the Dedsion function. The proposed configurations are 
voted on bit-by-hit which will ensure a consistent view on a new 
configuration at every co rrec tly working site. 

A degradation nv«n that the Local Executive instructs 
the recei ving entities to stop listen to that site, or if the faulty 
version b local to the same rite, to terminate the version and to 
quit sending messages. The new number of exp-e t ed results b 
adjusted accordingly. After a rite b degraded, it will still 
coll ect messages and o p era te input/output, but it win oot deliver 
them to the local version, provided that the fault only affects 
the version. 

The Local Executive. The Local Executive is activated 
when the Dtririrtn function indicates that the result b oot 
unanimous, or when some unrec o ve r able pn rrq £3 signaled 
from (be local version or some other layer. The Local 
Exe c uti ve will Cist try to re c o ver locally from the fault before it 
either rep orts the problem to the Global Execu ti v e or. if it b 
conside red as fatal to the site, doses down the rite. There are 
throe rtaw-i of exceptions that are considered, as 
below. 

Functional exceptions are specified In the functional description 
of DEDIX and they are independent of the implementation. 
Among them are the raised exceptions from an unanimous 
result, when a commnnirarion link b disconnected, and when a 
oc- vector b completely missing. For dv-ie. exceptions the Loral 
Executive win attempt to keep the rite active, possibly 
terminating the local version, while keeping the input/output 
operating. 

Implementation exceptions are d epen dent on the sporifie 
computer system, language, and implementation technique 
cb ov ra. All UNIX signals, like segmentation faults, !■ ■««-« 
termination, invalid system rail, etc., belong to this class. 
Other examples are all the exc epti ons defined in DEDDC, lilo- 
rignaling when a function b called with an invalid parameter, 
or when an inrombtent state exists. Most at these exc eptions 
will force an orderly close down in ord er to be able to provide 


r> : 


data far analysis. 

Exceptions generated by the local version. The local version 
program may include facilities for exaeptiau handlin g and some 
of the exceptions may not be recoverable within the version. 
These exceptions arc sent to the layer as requests. Thr Local 
Executive will terminate the local version while keepin g the site 
alive. 

IS The Version Layer 

The purpose of this layer is to interface the i-th (local) 
version with the DEDDC system and to c a nca the state of the 
variables that are incorrect according to the rw-i-ion function. 
The function doing the interfacing is called the Cross-Check, or 
CC- Function since it is called as a function to the version, at 
each oc -point. Painters to the results to be c o rr ected are sent as 
parameters to this function. The CC- punch cm transfers the 
version representation of results into a oc-vector so that the 
internal represent! ban of a cc-vecnr in DEE DC is hidden to the 
version program. The CC- Function writes back the corrected 
results into the veruoo. 

2.6 The Decision Algorithm 

The rVriiion Algorithm is used to determine a single 
dcq'tinn result from the N-vtxsion results. The Derision 
Algorithm may utilize only a subset of all N results for a 
derision; for rxamplr. the first result that pisses an acceptance 
rest may be chmrn. In the case that c derision result cannot be 
determined, a higher level recovery procedure may be invoked. 

In DEDK we hove Imp lement ed a generic Derision 
Algorithm which may be replaced by user written routines 
provided that the interfaces are preserved. This allows 
application -specific derision algorithms to be incorporated in 
those cases where the default mechanisms are inappropriate; for 
nramplr, this may omiT hrrairtr of lack of sensitivity, or 
unneocssary climinabcn of program versions. 

The generic TV riali -m Algorithm is hierarchical In 
nature. The algorithm attempts to d etermi ne a derision by 
applying the following major derision rinses sequentially; 

(1) bit by bit - identical match only; 

(2) cosmetic - detecting character string differen c es caused 
by misspelling cr rharartrr substitution; 

(3) numeric • in t e g er and real number d tcuioos . 

All numer ic iWrmm nv a nvrK— value and It can be 
proved that, so long as the majority of versio ns are noc faulty, 
the rrcdian of all responses is acceptably dose to a supposed 
ideal value. Numeric values arc allowed to be different within 
some ‘skew interval* thus allowing results to be non-identical 
but still similar. 
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2.7 User Interface 

The user interface of DEEDC allows usen to debug the system 
as well as the versions, monitor the operations of tte system, 
apply stimuli to the system, and to collect empirical data during 
experi mentation. ■* 

irealt point. The break command mahli-t rho user to set 
Ureak points. At a breakpoint, DEEiX stops executing the 
versions and gees into the user interface where tivr user can 
enter commands to examine the current system states, i-vamtrv- 
pair execution history, or Inject stimuli to the system. The 
remove command drlc tes breakpoints set by the break 
command. The const rate command resumes execution cf the 
versions at a bre akpoi nt. The user may terminate executio n 
using i he quii command. 

Monitoring. The user can rraminr. the current contend of dg 
messagp pasaing through the Transport layer by using the 
display command. Since every message is legged, tit user may 
also specify conditions in the display command to ^ssy 

message logged in the past. The user can also mmiry the 
internal system states try using the xha» command, c.g., to 
rraminr the brczhpodnts which have fet, the results of 
voting, etc- 

Stimuli Injection. The user is allowed to iojoe faults to the 
system by changing the system states, e.g.. the ec-vcctor. by 
using the modify command. 

Statistic Cel l ret ina. The user interface gathers cmpirual data 
and co l lffm statistics of the experiments. Every message 
passing the transport Layer is logged into a file with a time- 
stamp. This enables the user to do post •execution analysis or 
even replay the enmimenc. Statistics Ifla-- rUpvd timr, system 
time, number of cc-poma executed, and their results of dm-wow 
arc also onlWtrd. 

3 Experimental Coak of the DEDIX Testbed 

The second generation experiments at UCLA have two 
fundamental goals: the inveadgadon and < rvaluadon of various 
fault-tclemnce mrrhnnisrm and the analysis and 
cha r acterization of the fault distributions of highly reliable 
program versions. 

3.1 Fault Tolerance Meduahou 

We expect to obtain quantitative experimental results 
abort the effectiveness at the fault tolerance mechanism* We 
also plan to evaluate the possible loss of performance due to tic: 
operation of the Caul:- tolerance mechanisms in the absence at 
' faults, as well as the cast af error recovery. 

A problem area that Is being thoroughly ---- i, rly- 
rec overy of failed versions t hrou gh backward and forward 
recovery, and rrinitiali7arlon. Since we assume that all versions 
are likely to contain design faults, ft Is critical to be able to 
re c ov er these versions as they fail, rather than merely degrade 
to N-l versions, then N-2 versions and so cn. A pilot 
experiment is underway in which faiWl versions are mo v er ed 
without requiring the ipriliadoa at the entire internal stale 
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An important and interesting application area that often 
requires very high reliability and availability Is rod-time 
execution of time -critical applications. However, the current 
implementation using Locus is likely to be too alow for this 
purpose. Despite this limitation of Locus, lb functional 
architecture can be used with faster transport service and faster 
scheduling pnlirri-i in a real-time system, while 1 mrt can be 
used to simulate real-time execution. 

We will investigate the effectiveness at design cG verity 
as a means of increasing software reliability within a 
constrained budget. We are interested in the easts of removing 
bugs and of enhancement. By combining relatively unverified, 
□□validated software versions to p r odiw highly reliable multi- 
version software we may be able to decrease east while 
increasing reliability. Most errors In the software versions anil 
be A-firerrd by the VWisipn Algorithm during on-line 
productive use of the system. The software faults then on be 
flr-H »hHc Limiting tfvir , r, nn system availability. 

Enhancing multiple software v ers i o ns la more (fiffinilt. 
thnulti be sufficiently modular and ^ so 

that enhancement will generally mfTnrt tew modules The 
extent to which mdule is affected then be to 
determine whether (1) existing v ersi ons should be modified to 
reflect the enhan c em ent. (2) existing versions should be 
discarded and new versions produced, or (3) new versions 
should be produced to »mplrm«-n r the enhancements and old 
versons kept to implement the original requirements. 
Experiments will be con ducted to giin insights into the criteria 
to be used for a rhoirr 


In the *taail-ardcr* usntxpt mem bers of fault tolerance 
research groups it several un i v er sities will unite software 
versions for use in a large cxprrimrg t- We expect that software 
versions produced at ge o graphically separate locations, by 
people with different experience who use different 
pr o gr amming languages, will erm ram substantial design 
diversity. It may be possible to utilize the rapidly growing 
population of computer hobbyists on a cont ractu al basis to 
provide individual «W"i> versions at their own locations. This 
would not require a large concen tration of skfllrd people and 
would allow for the loss of Individual p i og rammer s. 

Id The Fault Distributions of Highly Reliable Verdana 

The other major goal of the second gener ation 
experiments conc erns the distribution at faults In highly reliable 
progr a m versions. A recen t theoretical analysis at redundant 
software has claimed that there are major diff eren ces b etween 
tire models needed to describe redundant software faults and 
independent hardware faults [Eckh£5j. Indeed, a clear need 
was seen for empirical data to truly the rffrm of errors 

on highly reliable software systems. 

A model experiment has been specified in which IS 
general guidelines and 10 qv « -ifir tasks arc identified [KeIlS2], 
and second-generation ex p eri mentation is now underway at 
four universities (UCLA. University at Virginia, University of 
Ulinnia and North Carolina State University) [Avi84] to 
measure tha efficacy of design diversity and to demonstrate 
poten tial reliability Increases under large-scale, controlled 
ex p eri mental ooneCdons. 


We intend to produce these software versions under 
controlled conditions that t prox imate the development 
methodolo gies and environments uvH by advanced industrial 
fwiTiriea We will conduct extensive logging of wo rk periods 
and events such as er ror discovery, specification questions and 
answers, and test suite execution. The experimenters will 
provide a complete high-level external qn-ifii-arirm At all 
stages, questions about the i pi-rifim ri im will be submiced by 
electronic mail, reviewed by the experimenters, and reipuu ded 
to by e i nerron ie mail The determination that a question 
reveals a flaw in the ipnri firs tiara will cause a change to be 
broadcast to all pr o gramm ers at all sites. The deliverable items 
will incl ud e a design document, a series of compiled programs 
request nring the results of the top down development at each 
abstraction layer, a test plan and test log, and the final 
program. The delivered software is then subjected to an 
».-i-i-pran. e trxf. We win study fault distributions by conducting 
extensive testing of the versons with randomly gener ated test 
data. The nature and cause at all rirtrrtrri er rors will be 
analyzed. 

4 Specification Isaacs 

Significant pro gress has occurred in the development of 
formal s peci f ication languages since our previous experi ments 
[Avi84j. Our cuntxit goal is to compare and assess the 
applicability to practical use by application [i i ng i immi-n the 
following formal program specifics non methods: 

(1) The CLEAR sped firs non language developed at 
Edinburgh University and SRI International; [Bun81] 

(2) The LARCH family et spr ofiesti em languages 
(fcvdopcd at Xerox Palo Alto Research Center and at 
MJ.T.; [GunS3] 

(3) The OBI specification language developed at UCLA; 
[Gogu79J 

(4) The Ina Jo specification language developed at SDQ 
[LocafiO] 

(5) The *M* specification language, desc en d ed from ~ZT ; 
[Meye84] 

(6) The applicability of Conn meat Prolog u a m e thod of 
formal specification. 

The study focuses on the assessment of the following 
aspects of the specification languages: (1) The p ur pose and 
scope (problem domain); (2) Completeness of develo pmen t; (3) 
Quality and extent of documents non; (4) Exis t ener of s u pport 
environments; (5) Executability and suitability far rapid 
prototyping; (ti) Provision at Dotation to express tuning 
constraints and conc u rr en cy; (7) Methods of specification for 
exception handling; (8) Extensibility (or the specification at 
fault-tolerant multi-version software. 

The outcome of the study will be the selection of two or 
more specification languages far the subsequent experime ntal 
assessment at their applicability In the design of fault-tolerant 
mold-version software. Two major el emen ts of the exp er im e n t 
will ber 
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(1) The concurrent verification of tie specifications by 
symbolic execution and mutual interplay. 

(2) An assessment of (he practical applicability at the 
specification*, ax they arc uxed by application 
programmers in an N-version software ex per i me nt. 

The next step in DEDDC development will be a formal 
specification of parts of the c ur r en t DEDDC p rot o t ype 
(implemented In C): the Synchronization layer, the Doction 
function, and the Loral and Global Executive*. The 
specification will provide an executable prototype of the 
DEDDC lupervixory operating system as well a* the application 
version*. This functional sp-rifirarion should allow not only the 
migration tc real-time systems, but also the use of muld-versioa 
SOftWCTV techn iq ues for rhe fault -tolerance mechani s m s of 
DEDDC themselves. The goal is a DEDDC system that su p p o rts 
design diversity in application programs and which is itself 
diverse in design at each site. 
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Independent specification* of some DEDDC system (Cns82] 
mn du lrs in two or more, formal language* will serve to compare 
the merits of the methods. Further research is planned in the 
application of dual diverse formal spuofimtiom to eliminate 
similar er ro rs traaeable to sperdfiattiem faults and to increase 
the dependability of the specifications. [Eckb&S] 


5 Conclusion 


This paper has presented an overview of a major effort 
to develop a research envi r o nm ent for software design diversity 
research at UCXA. Tbe c ompl ete DEDDC prototype has been 
implemented. and scxxmd^enermtion experiment* ate 
underway. Several other research efforts also have been 
initiated. 
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